



#### Introduction to Slide Set 10

- In prior side sets we explored hardware only approaches to extracting more performance from software.
- In this slide set, we explore simple software approaches to extracting performance from a single thread. We will do this in the context of a single issue pipelined processor with multicycle floating-point operations (e.g., as in Slide Set 6)
- Then we will look at an example of a real processor that uses Tomasulo's algorithm, multiple instruction issue and a reorder buffer—the Pentium 4.
- Finally, we will consider the limits of instruction level parallelism found in software.



#### Learning Objectives

- After we finish this slide set in lectures you will be able to
  - List two software approaches to enhance instruction level parallelism (scheduling, loop unrolling)
  - Apply loop unrolling to a simple loop for a scalar pipeline.
  - Apply loop unrolling to a simple loop for a VLIW pipeline.
  - Explain why the software based approaches do not completely replace the need for hardware approaches.
  - List one or two details of Pentium 4 microarchitecture.
     Explain how they impact performance of Pentium 4.
  - Explain sources of limits to instruction level parallelism (ILP)



## Running Example

• This code adds a scalar to a vector:

```
for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;
```

• Assume a single-issue, in order pipeline with the following latencies in all examples:

| Instruction producing result | Instruction using result | Execution in cycles | Latency in cycles |
|------------------------------|--------------------------|---------------------|-------------------|
| FP ALU op                    | Another FP ALU op        | 4                   | 3                 |
| FP ALU op                    | Store double             | 3                   | 2                 |
| Load double                  | FP ALU op                | 1                   | 1                 |
| Load double                  | Store double             | 1                   | 0                 |
| Integer op                   | Integer op               | 1                   | 0                 |
| Integer op                   | Branch op                | 1                   |                   |



## Running Exam x[i+1] do we need to know

In this example, to compute **?**[i]x

This code adds a scalar to a vector:

```
for (i=1000; i>0; i=i-1)
     x[i] = x[i] + s;
```

Assume a single-issue, in order pipeline with the following late

A: Yes, very sure

B: Yes, but not sure

C: Not sure

D: No, but not sure

E: No, very sure

| Instruction producing result | Instruction using result | Execution in cycles | Latency in cycles |   |
|------------------------------|--------------------------|---------------------|-------------------|---|
| FP ALU op                    | Another FP ALU op        | 4                   | 3                 |   |
| FP ALU op                    | Store double             | 3                   | 2                 |   |
| Load double                  | FP ALU op                | 1                   | 1                 |   |
| Load double                  | Store double             | 1                   | 0                 |   |
| Integer op                   | Integer op               | 1                   | 0                 |   |
| Integer op                   | Branch op                | 1                   |                   | 1 |
|                              |                          |                     |                   |   |



Running Exam In this example, to compute x[i+1] do we need to know **?**[i]x

This code adds a scalar to a vector:

```
for (i=1000; i>0; i=i-1)
     x[i] = x[i] + s;
```

Assume a single-issue, in order pipeline with the following late

A: Yes, very sure

B: Yes, but not sure

C: Not sure

D: No, but not sure

E: No, very sure √

| Instruction producing result | Instruction using result | Execution in cycles | Latency in cycles |   |
|------------------------------|--------------------------|---------------------|-------------------|---|
| FP ALU op                    | Another FP ALU op        | 4                   | 3                 |   |
| FP ALU op                    | Store double             | 3                   | 2                 |   |
| Load double                  | FP ALU op                | 1                   | 1                 |   |
| Load double                  | Store double             | 1                   | 0                 |   |
| Integer op                   | Integer op               | 1                   | 0                 |   |
| Integer op                   | Branch op                | 1                   |                   | 1 |



## Running Example

• This code adds a scalar to a vector:

```
for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;
```

• Assume a single-issue, in order pipeline with the following latencies in all examples:

| Instruction producing result | Instruction using result | Execution in cycles | Latency in cycles |
|------------------------------|--------------------------|---------------------|-------------------|
| FP ALU op                    | Another FP ALU op        | 4                   | 3                 |
| FP ALU op                    | Store double             | 3                   | 2                 |
| Load double                  | FP ALU op                | 1                   | 1                 |
| Load double                  | Store double             | 1                   | 0                 |
| Integer op                   | Integer op               | 1                   | 0                 |
| Integer op                   | Branch op                | 1                   |                   |



#### FP Loop: Where are the Hazards?

First translate into MIPS code:

```
Loop: L.D F0,0(R1) ; Regs[F0]=vector element ADD.D F4,F0,F2 ; add scalar from Regs[F2] S.D F4,0(R1) ; store result DADDI R1,R1,#-8 ; decrement pointer 8B (DW) BNE R1,R2,Loop ; branch Regs[R1]!=Regs[R2]
```



#### FP Loop: Where are the Hazards?

First translate into MIPS code:

```
Loop: L.D F0,0(R1) ; Regs[F0]=vector element ADD.D F4,F0,F2 ; add scalar from Regs[F2] S.D F4,0(R1) ; store result DADDI R1,R1,#-8 ; decrement pointer 8B (DW) BNE R1,R2,Loop ; branch Regs[R1]!=Regs[R2]
```



#### FP Loop: Where ar

How many true data dependencies are there in one iteration of this loop?

First translate into MIPS code:

L.D

ADD.D

Loop:

```
B: 1
C: 2
D: 3
```

F0,0(R1); F4,F0,F2; E: Not sure

```
S.D F4,0(R1); store result

DADDI R1,R1,#-8; decrement pointer 8B (DW)

BNE R1,R2,Loop; branch Regs[R1]!=Regs[R2]
```



#### FP Loop: Where ar

How many true data dependencies are there in one iteration of this loop?

First translate into MIPS code:

```
Loop: L.D F0,0(R1);
ADD.D F4,F0,F2;
S.D F4,0(R1); store result
DADDI R1,R1,#-8; decrement pointer 8B (DW)
BNE R1,R2,Loop; branch Regs[R1]!=Regs[R2]
```



#### FP Loop: Where are the Hazards?

First translate into MIPS code:

```
Loop: L.D F0,0(R1) ; Regs[F0]=vector element ADD.D F4,F0,F2 ; add scalar from Regs[F2] S.D F4,0(R1) ; store result DADDI R1,R1,#-8 ; decrement pointer 8B (DW) BNE R1,R2,Loop ; branch Regs[R1]!=Regs[R2]
```



# FP Loop: Showing Stalls (in order, single-issue pipeline)

```
F0,0(R1); Regs[F0]=vector element
1 Loop: L.D
2
        stall .
3
        ADD.D F4,F0,F2; add scalar in Regs[F2]
        stall 
4
5
        stall
6
        S.D F4,0(R1); store result
        DADDI R1,R1,#-8; decrement pointer 8B (DW)
8
        stall
9
               R1,R2,Loop; branch Regs[R1]!=Regs[R2]
        BNE
```

| Instruction producing result | Instruction using result | Latency in clock cycles |
|------------------------------|--------------------------|-------------------------|
| FP ALU op                    | Another FP ALU op        | 3                       |
| FP ALU op                    | Store double             | 2                       |
| Load double                  | FP ALU op                | 1                       |
| Integer op                   | Branch op                | 1                       |



# FP Loop: Showing Stalls (in order, single-issue pipeline)

| Instruction producing result | Instruction using result | Latency in clock cycles |
|------------------------------|--------------------------|-------------------------|
| FP ALU op                    | Another FP ALU op        | 3                       |
| FP ALU op                    | Store double             | 2                       |
| Load double                  | FP ALU op                | 1                       |
| Integen on                   | Branch on                | 1                       |

• 9 clocks: Rewrite code to minimize stalls?



# FP Loop: Showin dependencies with other (in order, single-issue pinstructions in the loop

Does DADDUI have any data instructions in the loop ; Regs[F0]=ve besides the BNE instruction?

```
F0,0(R1)
 Loop:
         L.D
         stall
         ADD.D
               F4,F0,F2; add scalar
         stall 
         stall.
                F4,0(R1)
                           ; store resu
6
         S.D
                R1,R1,#-8; decrement pointer 8B (DW)
         DADDI
8
         stall.
                R1,R2,Loop; branch Regs[R1]!=Regs[R2]
         BNE
```

| Instruction producing result | Instruction using result | Latency in clock cycles |
|------------------------------|--------------------------|-------------------------|
| FP ALU op                    | Another FP ALU op        | 3                       |
| FP ALU op                    | Store double             | 2                       |
| Load double                  | FP ALU op                | 1                       |
| Integer op                   | Branch op                | 1                       |

9 clocks: Rewrite code to minimize stalls?



# FP Loop: Showin dependencies with other (in order, single-issue pinstructions in the loop

Does DADDUI have any data instructions in the loop

```
F0,0(R1); Regs[F0]=ve besides the BNE instruction?
 Loop:
        L.D
         stall.
        ADD.D
               F4,F0,F2; add scalar
         stall 
         stall.
                F4,0(R1)
                           ; store resu
6
         S.D
        DADDI
                R1,R1,#-8; decrement pointer 8B (DW)
8
         stall.
                R1,R2,Loop; branch Regs[R1]!=Regs[R2]
         BNE
```

| Instruction      | Instruction       | Latency in   |
|------------------|-------------------|--------------|
| producing result | using result      | clock cycles |
| FP ALU op        | Another FP ALU op | 3            |
| FP ALU op        | Store double      | 2            |
| Load double      | FP ALU op         | 1            |
| Integer op       | Branch op         | 1            |

9 clocks: Rewrite code to minimize stalls?



# FP Loop: Showing Stalls (in order, single-issue pipeline)

| Instruction producing result | Instruction using result | Latency in clock cycles |
|------------------------------|--------------------------|-------------------------|
| FP ALU op                    | Another FP ALU op        | 3                       |
| FP ALU op                    | Store double             | 2                       |
| Load double                  | FP ALU op                | 1                       |
| Integen on                   | Branch on                | 1                       |

• 9 clocks: Rewrite code to minimize stalls?



# FP Loop: Showin DADDI before 5.D and (in order, single-issue pensure code works same

Is there \*any\* way to move DADDI before S.D and ensure code works same as it did before?

A: Yes

```
F0,0(R1); Regs[F0]=ve did before?
1 Loop:
        L.D
         stall.
        ADD.D
               F4,F0,F2; add scalar
         stall 
         stall.
                F4,0(R1)
                           ; store resu
6
        S.D
        DADDI R1,R1,#-8; decrement pointer 8B (DW)
8
        stall.
                R1,R2,Loop; branch Regs[R1]!=Regs[R2]
        BNE
```

| Instruction producing result | Instruction using result | Latency in clock cycles |
|------------------------------|--------------------------|-------------------------|
| FP ALU op                    | Another FP ALU op        | 3                       |
| FP ALU op                    | Store double             | 2                       |
| Load double                  | FP ALU op                | 1                       |
| Integer op                   | Branch op                | 1                       |

9 clocks: Rewrite code to minimize stalls?



# FP Loop: Showin DADDI before 5.D and (in order, single-issue pensure code works same

Is there \*any\* way to move
DADDI before S.D and
ensure code works same as it
did before?

A: Yes \
B: Not sure

```
F0,0(R1); Regs[F0]=ve did before?
1 Loop:
        L.D
        stall.
        ADD.D
               F4,F0,F2; add scalar
        stall 
        stall.
                F4,0(R1)
                           ; store resul
6
        S.D
        DADDI R1,R1,#-8; decrement pointer 8B (DW)
8
        stall.
                R1,R2,Loop; branch Regs[R1]!=Regs[R2]
        BNE
```

| Instruction producing result | Instruction using result | Latency in clock cycles |
|------------------------------|--------------------------|-------------------------|
| FP ALU op                    | Another FP ALU op        | 3                       |
| FP ALU op                    | Store double             | 2                       |
| Load double                  | FP ALU op                | 1                       |
| Integer op                   | Branch op                | 1                       |

9 clocks: Rewrite code to minimize stalls?



# FP Loop: Showing Stalls (in order, single-issue pipeline)

| Instruction producing result | Instruction using result | Latency in clock cycles |
|------------------------------|--------------------------|-------------------------|
| FP ALU op                    | Another FP ALU op        | 3                       |
| FP ALU op                    | Store double             | 2                       |
| Load double                  | FP ALU op                | 1                       |
| Integen on                   | Branch on                | 1                       |

• 9 clocks: Rewrite code to minimize stalls?



#### Revised FP Loop Minimizing Stalls

```
Loop: L.D
              F0,0(R1)
2
               R1,R1,#-8
       DADDI
3
              F4,F0,F2
       ADD.D
4
        stall 
5
       stall
6
       S.D
              F4,8(R1)
       BNEZ
              R1,Loop
```

Swap DADDUI and S.D.



### Revised FP Loop Minimizing Stalls

```
F0,0(R1)
 Loop: L.D
2
               R1,R1,#-8
        DADDI
3
        ADD.D
               F4,F0,F2
        stall 
4
5
        stall.
6
        S.D
               F4,8(R1)
        BNEZ
               R1,Loop
```

#### Swap DADDUI and S.D.

7 clocks, but just 3 for execution (L.D,ADD.D,S.D), 4 for loop overhead (DADDI, BNE, two stalls);



### Revised FP Loop Minimizing Stalls

```
Loop: L.D
               F0,0(R1)
2
               R1,R1,#-8
        DADDI
3
        ADD.D
               F4,F0,F2
        stall 
4
5
        stall.
6
        S.D
               F4,8(R1)
        BNEZ
               R1,Loop
```

#### Swap DADDUI and S.D.

7 clocks, but just 3 for execution (L.D,ADD.D,S.D), 4 for loop overhead (DADDI, BNE, two stalls);

How to make faster?



### **Loop Unrolling**

- We can reduce loop overhead by making multiple copies of loop body inside loop.
- This is known as loop unrolling.
- In the next few slides we will look at how to apply loop unrolling to the example on the last few slides.

- The steps we will apply are:
  - 1. Make copy of loop body and remove overhead instructions
  - 2. Rename registers in the assembly code to remove name dependencies (unlike Tomasulo's algorithm, this renaming is done in software rather than in hardware)
  - 3. Schedule instructions to reduce stalls ensuring that new code computes same result as original code.



# Unroll Loop Four Times (straightforward way)

```
__1 cycle stall
                F0,0(R1)
  Loop: L.D
2
                F4,F0,F2
        ADD.D

2 cycles stall
;drop DADDUI & BNE
        S.D
                F4,0(R1)
4
        L.D
                F0, -8(R1)
5
                F4,F0,F2
        ADD.D
6
        S.D
                F4,-8(R1
                                ;drop DADDUI & BNE
                F0, -16(R1)
        L.D
8
        ADD.D
                F4,F0,F2
9
                F4,-16(R1)
        S.D
                                ;drop DADDUI & BNE
10
                F0, -24(R1)
        L.D
11
                F4,F0,F2
        ADD.D
12
        S.D
                F4,-24(R1)
13
        DADDUI R1,R1,#-32
                                ;alter to -4*8
14
                R1, R2, LOOP
        BNE
```

1 cycle stall  $14 + 4 \times (1+2) + 1 = 27$  clock cycles, or 6.75 cycles per iteration Assumes # of loop iterations is multiple of 4



# Unroll Loop Four Times (straightforward way)

```
—1 cycle stall
                 F0,0(R1)
  Loop: L.D
                                                      Rewrite loop to
2
                 F4,F0,F2
        ADD.D

// 2 cycles stall
;drop DADDUI & BNE
                                                        minimize stalls?
        S.D
                 F4,0(R1)
        L.D
                 F0, -8(R1)
5
                 F4,F0,F2
        ADD.D
6
        S.D
                 F4,-8(R1
                                 ;drop DADDUI & BNE
                 F0, -16(R1)
        L.D
8
        ADD.D
                F4,F0,F2
9
        S.D
                F4,-16(R1)
                                 ;drop DADDUI & BNE
10
                F0, -24(R1)
        L.D
11
                F4,F0,F2
        ADD.D
12
        S.D
                F4, -24(R1)
13
        DADDUI
                R1,R1,#-32
                                 ;alter to -4*8
14
                 R1, R2, LOOP
        BNE
```

1 cycle stall  $14 + 4 \times (1+2) + 1 = 27$  clock cycles, or 6.75 cycles per iteration Assumes # of loop iterations is multiple of 4



#### Where are the name dependencies?

```
Loop:L.D F0,0(R1)
2
      ADD.D F4,F0,F2
3
      S.D 0(R1), F4
                          ;drop DADDUI & BNE
4
      L.D F0, -8(R1)
5
      ADD.D F4,F0,F2
6
      S.D -8 (R1), F4
                          ;drop DADDUI & BNE
      L.D F0, -16(R1)
8
      ADD.D F4, F0, F2
9
      S.D -16(R1), F4
                         ;drop DADDUI & BNE
10
      L.D F0, -24(R1)
11
      ADD.D F4,F0,F2
12
      S.D -24(R1), F4
13
      DADDUI R1,R1,#-32 ;alter to 4*8
14
             R1,R2,LOOP
      BNE
```



#### Where are the nam

Ignoring name dependencies due to R1, which of the following is closest to the number of name dependencies in the code below?

```
Loop:L.D
               F0,0(R1)
2
               F4,F0,F2
       ADD.D
3
               0(R1),F4
       S.D
4
       L.D
               F0, -8(R1)
5
       ADD.D F4,F0,F2
                              ; dr D: 15
6
       S.D
               -8(R1), F4
               F0, -16(R1)
       L.D
                                  E: What is a name dependency again?
8
       ADD.D F4,F0,F2
9
       S.D
               -16(R1), F4
                              ;drop DADDUI & BNE
10
       L.D
               F0, -24(R1)
11
       ADD.D F4,F0,F2
12
               -24(R1), F4
       S.D
13
       DADDUI R1,R1,#-32
                              ;alter to 4*8
14
               R1,R2,LOOP
       BNE
```



#### Where are the nam

Ignoring name dependencies due to R1, which of the following is closest to the number of name dependencies in the code below?

```
Loop:L.D
               F0,0(R1)
2
               F4,F0,F2
       ADD.D
3
               0(R1),F4
        S.D
4
       L.D
               F0, -8(R1)
                                   C: 12 = (6 output) + (6 anti)
5
       ADD.D F4,F0,F2
                              ; dr D: 15
6
        S.D
               -8(R1), F4
               F0, -16(R1)
       L.D
                                   E: What is a name dependency again?
8
       ADD.D F4,F0,F2
9
        S.D
               -16(R1), F4
                               ;drop DADDUI & BNE
               F0, -24(R1)
10
       L.D
11
       ADD.D F4,F0,F2
12
               -24(R1), F4
        S.D
13
       DADDUI R1,R1,#-32
                              ;alter to 4*8
14
               R1,R2,LOOP
        BNE
```



#### Where are the nam

Ignoring name dependencies due to R1, which of the following is closest to the number of name dependencies in the code below?

```
Loop:L.D
                F0,0(R1)
               F4,F0,F2
2
        ADD.D
                                     A: 6
3
               70(🔼),F4
        S.D
                                 ; dr
4
        L.D
                                     C: 12 = (6 output) + (6 anti)
5
        ADD.D \angle F4/F0, F2
                                 ; dr D: 15
6
        S.D
        L.D
                                     E: What is a name dependency again?
8
        ADD.D
9
                   64R1), F4
        S.D
                                 ;drop DADDUI
10
        L.D
11
               F4, F9, F2
        ADD.D
                 -24 (R1),F4
12
        S.D
13
                R1,R1,#-32
        DADDUI
                                 ;alter to 4*8
14
                R1,R2,LOOP
        BNE
```

anti-dependencies on R1 omitted for clarity



#### Where are the name dependencies?

```
Loop:L.D
                F0,0(R1)
2
        ADD.D
3
        S.D
                               ;drop DADDUI & BNE
4
        L.D
5
        ADD.D \angle F4/F0, F2
6
        S.D
                               ;drop DADDUI
                                              & BNE
        L.D
8
        ADD.D
9
        S.D
                  64R1), F4
                               ;drop DADDUI & BNE
10
        L.D
11
        ADD.D
               F4, E9, F2
12
                -24(R1),F4
        S.D
13
        DADDUI R1, R1, #-32
                               ;alter to 4*8
14
                R1,R2,LOOP
        BNE
```

How can we remove them?

anti-dependencies on R1 omitted for clarity



#### Apply Renaming

```
1 Loop:L.D \mathbf{F0}, 0 (R1)
2
       ADD.D F4, F0, F2
3
       S.D \quad O(R1), F4
                           ;drop DADDUI & BNE
4
       L.D F6, -8 (R1)
      ADD.D F8, F6, F2
5
6
       S.D -8 (R1), F8
                          ;drop DADDUI & BNE
7
       L.D F10, -16(R1)
8
      ADD.D F12, F10, F2
9
       S.D -16(R1), F12 ; drop DADDUI & BNE
10
       L.D F14, -24(R1)
11
       ADD.D F16, F14, F2
12
       S.D -24(R1), F16
13
       DADDUI R1,R1,#-32 ;alter to 4*8
14
       BNE
             R1,R2,LOOP
```



#### **Unrolled Loop That Minimizes Stalls**

```
F0,0(R1)
 Loop:L.D
2
      L.D
             F6, -8(R1)
3
      L.D
             F10,-16(R1)
4
      L.D
             F14,-24(R1)
5
      ADD.D
             F4,F0,F2
6
             F8, F6, F2
      ADD.D
7
             F12,F10,F2
      ADD.D
8
             F16,F14,F2
      ADD.D
9
      S.D
             0(R1), F4
10
             -8(R1),F8
      S.D
11
      DADDUI R1,R1,#-32
12
             16(R1), F12; 16 = -16 + 32
      S.D
13
      S.D
             8(R1), F16 ; 8 = -24 + 32
14
      BNE
             R1,R2, LOOP
```

14 clock cycles, or 3.5 per iteration



#### **Unrolled Loop That Minimizes Stalls**

```
    Assumptions we made:

               F0,0(R1)
  Loop:L.D

    OK to move store past

2
       L.D
               F6, -8 (R1)
                                           DADDUI even though
3
       L.D
               F10,-16(R1)
                                           DADDUI changes register
4
       L.D
               F14,-24(R1)

    OK to move loads before

5
       ADD.D
               F4,F0,F2
                                           stores: Needed to check if we
6
               F8, F6, F2
       ADD.D
                                           get the right data
7

    In general, compiler needs to

       ADD.D
               F12,F10,F2
                                           check for both register and
8
       ADD.D
               F16,F14,F2
                                           memory dependencies
9
               0(R1),F4
       S.D
10
       S.D
               -8(R1),F8
11
       DADDUI R1,R1,#-32
12
       S.D
               16(R1), F12; 16 = -16 + 32
13
               8(R1), F16 ; 8 = -24 + 32
       S.D
14
       BNE
               R1,R2, LOOP
```

14 clock cycles, or 3.5 per iteration



# Compiler Based Scheduling (Static Scheduling)

- Compiler analyzes dependencies in program
- Tries to schedule to avoid hazards that cause performance losses
- Easy to determine for registers (fixed names)
- Hard for memory ("memory disambiguation" problem):
  - Example:

SD R1, 100(R4) LD R2, 20(R6)

- Is there a data dependence?
- Equivalent question: "Does 100(R4) = 20(R6)"?
- Possible answers: definitely yes, maybe, definitely no
- If "by inspection" we can see the answer is "definitely no", we can move LD before SD. Compilers answer this question using complex algorithms beyond the scope of this course



#### Compiler Perspectives on Code Movement



# Compiler Perspectives on Code Movement

Our example required compiler to know that if R1 doesn't change then:

```
0 (R1) \neq -8 (R1) \neq -16 (R1) \neq -24 (R1)
```



# Compiler Perspectives on Code Movement

Our example required compiler to know that if R1 doesn't change then:

```
0 (R1) \neq -8 (R1) \neq -16 (R1) \neq -24 (R1)
```

There were no dependencies between some loads and stores so they could be moved by each other.



### Details Steps to Unroll

- Determine unrolling the loop would be useful by finding that the loop iterations were independent
- Eliminate extra test and branch instructions and adjust the loop termination and iteration code
- Rename registers to avoid name dependencies
- Determine loads and stores in unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent
  - Requires analyzing memory addresses and finding that they do not refer to the same address.
- Schedule the code, preserving any dependences needed to yield same result as the original code



#### Unknown Loop Bounds

- Do not usually know upper bound of loop
- Suppose it is "n", and we would like to unroll the loop to make "k" copies of the body
- Instead of a single unrolled loop, we generate a pair of consecutive loops:
  - 1st executes (n mod k) times and has a body that is the original loop
  - 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times
  - For large values of n, most of the execution time will be spent in the unrolled loop



# Example

(using C syntax for clarity, not fully optimized)

```
// Original Loop
for(i=0; i < n; i++) {
  A[i] = B[i]:
// After applying loop unrolling:
// first loop : original loop body, iterates n mod k times
for(i=0; i < n\%4; i++) {
  A[i]=B[i];
// second loop: unroll original loop 4 times, iterates n/k times
for( i=0; i < n/4; i++ ) {
  A[4*i + n%4 + 0] = B[4*i + n%4 + 0];
  A[4*i + n%4 + 1] = B[4*i + n%4 + 1];
  A[4*i + n%4 + 2] = B[4*i + n%4 + 2];
  A[4*i + n\%4 + 3] = B[4*i + n\%4 + 3]:
```



# Very Long Instruction Word (VLIW)

VLIW: instructions that "encode" multiple operations.

The hardware executes the entire instruction "at once" on parallel function units in execute stage.

The "long instructions" encode the fact that the operations are <u>independent</u>. So, the hardware does not need to dynamically figure this out. This can save area and power and is now popular in embedded processors (e.g., Texas Instruments C6x series of DSPs).

Inst N: Inst N+1: ADD.D F1,F1,F4NOP L.D F7, O(R1) DSUBI R2,R2,#1 NOP

ADD.D F1,F1,F7MUL.D F4,F5,F6 NOP DSUBI R2,R2,#1 BEQZ R2, Loop





# Loop Unrolling in VLIW

| Memory          | Memory          | FP               | FP       | Int. op/    | Clock  |
|-----------------|-----------------|------------------|----------|-------------|--------|
| reference 1     | reference 2     | operation 1      | op. 2    | branch      |        |
| L.D F0,0(R1)    | L.D F6,-8(R1)   |                  |          |             |        |
| L.D F10,-16(R1) | L.D F14, 24(R1) |                  |          |             |        |
| L.D F18,-32(R1) | L.D F22,-40(R1) | ADD.D F4, ₹0, F2 | ADD.D F  | 8,F6,F2     |        |
| L.D F26,-48(R1) | _               | ADD.D F12,F10,F2 | ADD.D F  | 16,F14,F2   |        |
|                 |                 | ADD.D F20,F18,F2 | ADD.D F2 | 4,F22,F2    |        |
| S.D 0(R1),F4    | S.D -8(R1),F8   | ADD.D F28,F26,F2 |          |             |        |
| S.D -16(R1),F12 | S.D -24(R1),F16 |                  |          |             |        |
| S.D -32(R1),F20 | S.D -40(R1),F24 |                  |          | DSUBUI R1,I | R1,#48 |
| S.D -0(R1),F28  |                 |                  |          | BNEZ R1,LO  | OP     |

Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (~2.7X) 23 ops / 9 cycles = 2.5 ops per clock, 2.5/5 = 50% efficiency Note: Need more registers in VLIW (15 vs. 6)



# Pentium 4 History

- Willamette (Nov 2000)
  - Process tèchnology: 180 nm
  - Pipeline stages: 21
- Northwood (Jan 2002)
  - Process technology: 130 nm
  - Pipeline stages: 21
  - Hyperthreading support added (or finally debugged) for later versions
- Prescott (Feb 2004)
  - Process technology: 90 nm
  - Pipeline stages: 31
  - Hyperthreading
- Cedar Mill (early 2006)
  - Prescott, but in 65 nm
  - End of road for Pentium 4 Microarchitecture... consuming too much power

[source (for dates): Wikipedia]



### Pentium 4 ("Prescott") Block Diagram



© 2007 Elsevier, Inc. All rights reserved.



# Pentium 4 Pipeline ("Willamette")

| В   | asi | c Pe       | ntiu      | m III | Pr        | oces   | sor M       | ispre | dictio       | n Pi         | pelir       | 16 |
|-----|-----|------------|-----------|-------|-----------|--------|-------------|-------|--------------|--------------|-------------|----|
| fet | ch  | 2<br>Fetch | 3<br>Deco | de De | 4<br>code | 100000 | 6<br>Rename | 1000  | 8<br>Rdy/Sch | 9<br>Dispatc | 10<br>h Exe |    |
|     |     |            |           |       |           |        |             |       |              |              |             |    |
| В   | asi | c Pe       | ntiu      | m 4   |           |        | or Mi       | •     | dictio       | n Pip        |             | e  |



[Source: "The Microarchitecture of the Pentium® 4 Processor", Intel Technology Journal, Vol. 5 Issue 1 (February 2001) special issue on the Pentium® 4 Processor]



# Register Renaming



[Source: "The Microarchitecture of the Pentium® 4 Processor", Intel Technology Journal, Vol. 5 Issue 1 (February 2001) special issue on the Pentium® 4 Processor]



# Branch Mispredictions





# Branch Mispredictions



Floating point benchmarks: Less branches, and branches are more predictable.



Instructions "thrown out" due to branch mispredictions (or jump target mispredictions)



Note similarity to branch prediction accuracy



#### Cache Misses



L2 cache misses more important for performance... why?



# Pentium 4 CPI v.s AMD Opteron CPI





#### Pentium 4 CPI v.s AMD

Which <u>increases</u> performance:

A: Lower CPI





gzip

#### Pentium 4 CPI v.s AMD performance:

Which <u>increases</u>

Lower CPI ✓

Higher CPI





# Pentium 4 CPI v.s AMD Opteron CPI





gzip

#### Pentium 4 CPI v.s AMD

Taking into account clock frequency difference (see below) which do you think is fastest?





gzip

#### Pentium 4 CPI v.s AMD frequency difference (see

Taking into account clock frequency difference (see below) which do you think is fastest?





# Pentium 4 CPI v.s AMD Opteron CPI





### Pentium 4 Performance v.s AMD Opteron Performance



Opteron slightly faster overall: Higher clock frequency not enough to overcome reduction in CPI due to increased stalls



# Limits to ILP (Instruction Level Parallelism)



# Limits to ILP (Instruction Level Parallelism)

 How much ILP is available using superscalar OoO mechanisms with increasing HW budgets?



# Limits to ILP (Instruction Level Parallelism)

- How much ILP is available using superscalar OoO mechanisms with increasing HW budgets?
- Do we need to invent new HW/SW mechanisms to keep on processor performance curve?
  - Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints
  - GPUs/Cell Processor?
  - Other?





Initial HW Model here; MIPS.



Initial HW Model here; MIPS.



Initial HW Model here; MIPS.

Assumptions for ideal/perfect machine to start:

1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided



Initial HW Model here; MIPS.

- 1. Register renaming infinite virtual registers => all register WAW & WAR hazards are avoided
- 2. *Branch prediction* perfect; no mispredictions



Initial HW Model here; MIPS.

- Register renaming infinite virtual registers => all register WAW & WAR hazards are avoided
- 2. *Branch prediction* perfect; no mispredictions
- 3. Jump prediction all jumps perfectly predicted 2 & 3 => machine with perfect speculation & an unbounded buffer of instructions available



Initial HW Model here; MIPS.

- Register renaming infinite virtual registers => all register WAW & WAR hazards are avoided
- 2. *Branch prediction* perfect; no mispredictions
- 3. Jump prediction all jumps perfectly predicted 2 & 3 => machine with perfect speculation & an unbounded buffer of instructions available
- 4. *Memory-address alias analysis* addresses are known & a store can be moved before a load provided addresses not equal



Initial HW Model here; MIPS.

Assumptions for ideal/perfect machine to start:

- 1. Register renaming infinite virtual registers => all register WAW & WAR hazards are avoided
- 2. *Branch prediction* perfect; no mispredictions
- 3. Jump prediction all jumps perfectly predicted 2 & 3 => machine with perfect speculation & an unbounded buffer of instructions available
- 4. *Memory-address alias analysis* addresses are known & a store can be moved before a load provided addresses not equal

#### Also:

- unlimited number of instructions issued/clock cycle; perfect caches;
- 1 cycle latency for all instructions



# Upper Limit to Instruction Level Parallelism: Ideal Machine





#### Impact of Instruction Window Size





# More Realistic HW: Branch Impact

Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle





# **Beyond Branch Prediction**

Control flow very often "re-converges" shortly after a branch.

```
y = lookup(x); // A
if(y > 0) {
    B++;
} else {
    C++;
}
foo(); // D
```



- In 1995-2005 timeframe: much research on "speculative threads". E.g., start short thread at D while executing at A. When original thread reaches D it combines state with speculative thread.
- Huge benefit: reduces sensitivity to branch mispredictions.
- Sun Microsystems "Rock" processor implemented a simple form of this and it seem likely Intel or AMD will eventually implement it since Amdahl's Law suggests single thread performance still matters.



# Impact of Number of Renaming Registers

2000 instruction window, maximum 64 inst/cycle, tournament predictor Floating-point more sensitive to number of rename registers.





# Impact of Memory Dependence Prediction

2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers (potential impact of memory "dependence prediction")









































# Yale Patt The University of Texas at Austin

World University Presidents' Symposium
University of Belgrade
April 4, 2009



# Future Microprocessors: What must we do differently to effectively utilize multi-core and many-core chips?

...and what are the implications re: education?

# Yale Patt The University of Texas at Austin

World University Presidents' Symposium
University of Belgrade
April 4, 2009











































































#### Intel Core 2 Duo

- Penryn, 2007
- 45nm, 3MB L2







• It is easier than designing a much better uni-core



- It is easier than designing a much better uni-core
- It is embarrassing to continue making L2 bigger



- It is easier than designing a much better uni-core
- It is embarrassing to continue making L2 bigger
- It is the next obvious step



- It is easier than designing a much better uni-core
- It is embarrassing to continue making L2 bigger
- It is the next obvious step
- It is not the Holy Grail



- It is easier than designing a much better uni-core
- It is embarrassing to continue making L2 bigger
- It is the next obvious step
- It is not the Holy Grail





• In the beginning: a better and better uniprocessor – improving performance on the hard problems



- In the beginning: a better and better uniprocessor improving performance on the hard problems
- More recently: a uniprocessor with a bigger L2 cache
   forsaking further improvement on the "hard" problems

  - poorly utilizing the chip area
  - and blaming the processor for not delivering performance



- In the beginning: a better and better uniprocessor
  - improving performance on the hard problems
- More recently: a uniprocessor with a bigger L2 cache
   forsaking further improvement on the "hard" problems

  - poorly utilizing the chip area
  - and blaming the processor for not delivering performance
- Today: dual core, quad core



- In the beginning: a better and better uniprocessor
  - improving performance on the hard problems
- More recently: a uniprocessor with a bigger L2 cache
   forsaking further improvement on the "hard" problems

  - poorly utilizing the chip area
  - and blaming the processor for not delivering performance
- Today: dual core, quad core



- In the beginning: a better and better uniprocessor
  - improving performance on the hard problems
- More recently: a uniprocessor with a bigger L2 cache
   forsaking further improvement on the "hard" problems

  - poorly utilizing the chip area
  - and blaming the processor for not delivering performance
- Today: dual core, quad core
- *Tomorrow:* ???





#### The Good News: Lots of cores on the chip



#### The Good News: Lots of cores on the chip

The Bad News: Not much benefit.



# Evolution of Thread-Level Parallelism in Desktop Applications

**Geoffrey Blake\***, Ronald G. Dreslinski\*, Trevor Mudge\*, Krisztián Flautner†

University of Michigan – Ann Arbor\*
ARM †





- 2000
  - Single core machines common
  - Clock speed steadily increasing Intel predicts to reach 10GHz by 2010





- 2000
  - Single core machines common
  - Clock speed steadily increasing Intel predicts to reach 10GHz by
     2010
- 2005
  - Consumer CMPs announced (from Intel/AMD)
  - Aggressive clock speed increases halt





- 2000
  - Single core machines common
  - Clock speed steadily increasing Intel predicts to reach 10GHz by
     2010
- 2005
  - Consumer CMPs announced (from Intel/AMD)
  - Aggressive clock speed increases halt
- 2010 and Future
  - Multi-core machines common
  - Core counts steadily increasing





- 2000
  - Single core machines common
  - Clock speed steadily increasing Intel predicts to reach 10GHz by
     2010
- 2005
  - Consumer CMDs appeared (from Intel/AMD)
  - Aç
     "If you build it, they will come"
- -- "Field of Dreams" 1989
  - Muili-core machines common
  - Core counts steadily increasing





#### Are threads used?

| Benchmark      | Threads<br>Created | Avg. Threads<br>Alive |
|----------------|--------------------|-----------------------|
| Handbrake 0.9  | 22511              | 24                    |
| Call of Duty 4 | 77                 | 44                    |
| Photoshop CS4  | 82                 | 75                    |
| Adobe Reader 9 | 239                | 24                    |
| Quicktime-HD   | 53                 | 52                    |
| Firefox 3.5    | 522                | 38                    |

- Many threads created
- Many threads alive and visible to the OS during runtime



#### Are threads used?

| Benchmark      | Threads<br>Created | Avg. Threads<br>Alive |
|----------------|--------------------|-----------------------|
| Handbrake 0.9  | 22511              | 24                    |
| Call of Duty 4 | 77                 | 44                    |
| Photoshop CS4  | 82                 | 75                    |
| Adobe Reader 9 | 239                | 24                    |
| Quicktime-HD   | 53                 | 52                    |
| Firefox 3.5    | 522                | 38                    |

- Many threads created
- Many threads alive and visible to the OS during runtime



#### Are threads used?

| Benchmark Handbrake 0.9 | Threads<br>Created<br>22511 | Avg. Threads Alive |
|-------------------------|-----------------------------|--------------------|
| Call of Duty 4          | 77                          | 44                 |
| Photoshop CS4           | 82                          | 75                 |
| Adobe Reader 9          | 239                         | 24                 |
| Quicktime-HD            | 53                          | 52                 |
| Firefox 3.5             | 522                         | 38                 |

- Many threads created
- Many threads alive and visible to the OS during runtime

















Time (3s)







#### Firefox 3.5



Time (3s)











Time (3s)











Time (3s)

**CPUO** 











# UBC

#### Overall TLP Results – Xeon Windows 7



# UBC

#### Overall TLP Results – Xeon Windows 7



# UBC

#### Overall TLP Results – Xeon Windows 7





# Why Instruction Level Parallelism <a href="Still">Still</a> Matters (Even With Multicore)

Assuming application is 99% parallelizable, what is better: More simple cores or improve IPC for sequential fraction (e.g., using superscalar or other "cleverness")?

